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Abstract. We report on a method for achieving a significant trunca- 
tion of the training space necessary for recognizing rigid 3D objects 
from perspective images. Considering objects lying on a table, the 
configuration space of continuous coordinates is three-dimensional. In 
addition the objects have a few distinct support modes. We show that 
recognition using a stationary camera can be carried out by training 
each object class and support mode in a two-dimensional configuration 
space. We have developed a transformation used during recognition for 
projecting the image information into the truncated configuration space 
of the training. The new concept gives full flexibility concerning the 
position of the camera since perspective effects are treated exactly. The 
concept has been tested using 2D object silhouettes as image property 
and central moments as image descriptors. High recognition speed and 
robust performance are obtained. 

Keywords: Computer vision for flexible grasping, recognition of 3D ob- 
jects, pose estimation of rigid object, recognition from perspective im- 
ages, robot- vision systems 

1 Introduction 

We describe here a method suitable for computer- vision-based flexible grasping 
by robots. We consider situations where classes of objects with known shapes, 
but unknown position and orientation are to be classified and grasped in a struc- 
tured way. Such systems has many quality measures such as recognition speed, 
accuracy of the pose estimation, low complexity of training, free choice of camera 
position, generality of object shapes, ability to recognize occluded objects, and 
robustness. We shall evaluate the properties of the present method in terms of 
these quality parameters. 

Recognition of 3D objects has been widely based on establishing correspon- 
dences between 2D features and the corresponding features on the 3D object 
[1-3]. The features has been point like, straight lines or curved image elements. 
The subsequent use of geometric invariants makes it possible to classify and pose 
estimate objects [4-8]. Another strategy is the analysis of silhouettes. When per- 
spective effects can be ignored, as when the objects are flat and the camera is 
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remote, numerous well established methods can be employed in the search for 
match between descriptors of recorded silhouette and those of silhouettes in a 
data base [9-11]. Other method are based on stereo vision or structured light [1, 
11-12]. 

In the present paper we face the following situation: 

• The rigid objects do not have visual features suitable for 2D-3D correspon- 
dence search or 2D-2D correspondence search used in stereo vision. 

• No structured light is employed. 

• The camera is not necessarily remote, so that we must take perspective 
effects into account. 

We propose here a 'brute force' method [13] in which a large number of images 
or image descriptors are recorded or constructed during training. Classification 
and pose estimation is then based on a match search using the training data 
base. A reduction of the configuration space of the training data base is desirable 
since it gives a simpler training process and smaller extent of the data bases. The 
novelty of the present method is the recognition based on training in a truncated 
configuration space. 

The method is based on a nonlinear 2D transformation of the image to be 
recognized. The transformation corresponds to a virtual displacement of the ob- 
ject into an already trained position relative to the camera. As relevant for many 
applications we consider objects lying at rest on a table or conveyer belt. In Sect. 
2 we describe the 3D geometry of the system. We introduce the concept, 'virtual 
displacement', and define the truncated training space. In Sect. 3 are described 
the relevant 2D transformation and the match search leading to the classifica- 
tion and pose estimation. We also specify our choice of descriptors and match 
criterion in the recognition. The method has been implemented by constructing 
a physical training setup and by developing the necessary software for training 
results and recognition. Typical data of the setup and the objects tested are 
given in Sect. 4. We also present a few representative test results. 

The work does not touch upon the 2D segmentation on which the present 
method must rely. The 2D segmentation is known to be a severe bottleneck if 
the scene illumination and relative positions of the objects are inappropriate. 
In the test we used back-lighting and nonoccluded objects in order to avoid 
such problems. Therefore we can not evaluate the system properties in case of 
complex 2D segmentation. 

2 The 3D Geometry and the Truncated Training Space 

In the present work we consider a selection of physical objects placed on a 
horizontal table or a conveyer belt, see Fig.l. The plane of the table surface is 
denoted 7r. We consider gravity and assume object structures having a discrete 
number of ways - here called support modes - on which the surface points touch 
the table. This means that we exclude objects, which are able to roll with a 
constant height of the center-of-mass. Let i count the object classes and let j 
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count the support modes. With fixed j, each object's pose has three degrees of 
freedom, e.g. (x,y,u>), where (x,y) is the projection on the table plane tt of its 
center- of- mass (assuming uniform mass density), and uo is the rotation angle of 
the object about a vertical axis through the center- of- mass. 

A scene with one or more objects placed on the table is viewed by an ideal 
camera with focal length /, pin hole position H at a distance h above the table, 
and an optical axis making an angle a; with the normal to the plane 7r, see Fig. 
1. The origin of (x,y) is H 7 s projection O on 7r, and the y-axis is the projection 
of the optical axis on 7r. Let {x,y, z) be the coordinates of a reference point of 
the object. We introduce the angle <p defined by: 

cos (p = — ~F= =? sin ^ = — - . (1) 

Consider a virtual displacement of the object so that its new position is given 

by: 

(x',y',u>') = (0,y/x*+v*, <*,-</>) (2) 

This displacement is a rotation about a vertical axis through the pinhole. Note 
that the same same points on the object surface are visible from the pin hole 
H in the original and displaced position. The inverse transformation to be used 
later is given by: 

x = y 1 sin 0, y = y f cos 0, uo = w' + <j> (3) 

The essential property of this transformation is that the corresponding 2D 
transformation is independent of the structure of the object. The truncation of 
the training space introduced in the present paper is based on this property. 
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Fig. 1. Horizontal (A.) and vertical (B) views of the system including the table, the 
camera, and the object before and after the virtual displacement. 

We focus on image properties condensed in binary silhouettes. Therefore, we 
assume that the scene is arranged with a distinguishable background color so 
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that each object forms a well defined silhouette f2(i,j,x,y,uj) on the camera 
image. Thus, f2(i, j, x,y, oS) is a list of coordinates (u,v) of set pixels in the 
image. We assume throughout that (u, v) = (0, 0) is lying on the optical axis. 
The task in the present project is to determine x, y, cu) from a measurement 
of an object's silhouette Q 0 and a subsequent comparison with the silhouettes 
f2(i,j,x = 0,y,cj) recorded or constructed in a reduced configuration space. In 
the data base the variables y and cu are suitably discretized. Silhouettes for the 
data base are either recorded by a camera using physical objects or constructed 
from a CAD representation. 

3 The 2D Transformation and the Match Search 

After the above mentioned virtual 3D displacement, an image point (u,v) of the 
object will have the image coordinates (u f ,v f ) given by: 

. f(u cos (ft + v sin <f> cos ex — f sin (ft sin ct) 

u {cft.u.v) = 2 — r 

u sin (psm cm + v(l — cos eft) sin a cos a + /(cos <p sin a + cos 2 a) 

(4) 



v f (cj>,u,v) 



f(—u sin <p cos « + v(cos cj> cos 2 (\ + sin 2 ct) + f(l — cos <j>) sin (\ cos ct) 
u sin <ft sin ct + v(l — cos 0) sin ct cos ct + /(cos eft sin 2 ct + cos 2 ct) 

(5) 



This result can be derived by considering - in stead of an object displacement - 
three camera rotations about H: A tilt of angle — c*, a roll of angle eft, and a tilt 
of angle a. Then the relative object-camera-position is the same as if the object 
were displaced according to the 3D transformation described in Sect. 2. Note 
that the inverse transformation corresponds to a sign change of (ft. 

By transforming all points in a silhouette £2 according to (4-5), one obtains 
a new silhouette £2' . Let us denote this silhouette transformation T^, so that 

Q ! = T^Q) (6) 

The 2D center-of-mass of Q is (u crri (f2), v crn (f2)) . The center-of-mass of the 
displaced silhouette f2 f is close to the transformed center-of-mass of the original 
silhouette Q. In other words 



u crrh (f2 f ) p=j u f (4>, u crri (£2),v crrt (f2)) (7) 
v crri (£2 f ) w v'(</>,u crn (f2),v crri (f2)) (8) 

This holds only approximately because the 2D transformation in (4-5) is 
nonlinear. In Fig 2 is shown a situation in which Q ! is a square. The black 
dot in f2 r is the transformed center-of-mass of J?, which is displaced slightly 
from the center-of-mass of J7'. 
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Fig. 2. An image component f2 and the transformed version Q' in case that Q' is a 
square. The values of o; and cj) are given in the upper right corner. The black dot f2 
is the 2D cent er-of. mass. After transformation this point has a position shown as the 
black dot in 1?'. 



Let Q tT = f2 tr (i, j, y, ui) be the silhouettes of the training with x = 0. The 
training data base consist of descriptors of J7 ir (i, j, y, cu) with suitably discretized 
y and uo. In case of not too complex objects, 

(Otr) ~ 0 (9) 

The object to be recognized has the silhouette f2 Q . This silhouette defines an 
angle cj> 0 given by 



4> 0 = arctan(— u <™&°) ) (l 0 ) 

./ sin ol — v C77l (f2 0 ) cos « 

According to (4, 7, 10), the transformed silhouette f2' 0 = T^ o (j7 0 ) has a 2D 
center-of-mass close to u = 0: 



u cm (f2' o )tt0 (11) 

We shall return to the approximations (7-9,11) later. 

Eqs. (9) and (11) imply that f2' 0 is to be found among the silhouettes 
f2tr(i", j, V, k- 7 ) of the data base. Because of the approximations (9,11), the simi- 
larity between Q l Q and f2 tr (i , j y , co) is not exact with regards to translation, so 
one must use translation al invariant descriptors in the comparison. 

In the search for match between f2 tr (i, j, y, ui) and £2' 0 it is convenient to 
use that v crn (f2 f 0 ) w v crri (f2^ r (i, j, y, ui)) . It turns out, that v crn (f2$. r (i, j,y, uj)) is 
usually a monotonous function of y, so - using interpolation - one can calculate a 
data base slice with a specified value v crri and with i,j,u) as entries. This means 
that i, j, and cj can be determined by a match search between moments of f2' 0 
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and moments in this data base slice. Note that the data base slice involves one 
continuous variable uo . With a typical step size of 3° the data base slice has only 
120 records per support mode and per object class. 

The result of the search are ^ ma tch, Jmatch, and cc; matc h- The value y ma tch can 
be calculated using the relation between y and v C7n (f2 tr ) for the relevant values of 
i, j, and uo. The original pose (ar, y, uf) can now be found by inserting y f = y m atch, 
= ^ matc h, and <j> = <p Q in Eq. (3). 

If the approximation (9) brakes down, one must transform all the silhouettes 
of the data base, so that the match search takes place between Q f Q and Q' tr = 
T<t> tr (f2tr) where 



<f> tr = arctan(— U ^^) ) (12) 

/ sin a — v crn (fJ tr ) cos a 

In this case <j> = <fi 0 — cj> tr should be inserted in (3) in stead of (j> 0 . 

We are left with the approximation (11), demonstrated in Fig. 2. This gives 
a slightly wrong angle <j> 0 . If the corresponding errors are harmful in the pose 
estimation, then one must perform an iterative calculation of o G . so that (11) 
holds exactly. 

We conclude this section by specifying our choice of 1) image descriptors 
used in the data base, and 2) recognition criterion. In our test we have used as 
descriptors the 8-12 lowest order central moments, namely //oo? M20, Mil, M02, ^30, 
M21? M2I5 and /X03. The first order moments are absent since we use translational 
invarianat moments. In addition we used in some of the tests, the width and 
height of the silhouette. In the recognition strategy we minimized the Euclidean 
distance in a feature space of descriptors. The descriptors were normalized in 
such a way that the global variance of each descriptor was equal to one [14]. 



4 The Experiments 

Fig. 3 shows the setup for training and test. We used a rotating table for scanning 
through the training parameter cu and a linear displacement of the camera for 
scanning through the parameter y. The pose estimation was checked by a grasp- 
ing robot. In order to avoid 2D segmentation problems we used backlighting in 
both training and recognition. 

The parameters for the setup and typical objects tested are shown in the 
Table 1. 

We report here the result of a test of a single object, a toy rabbit manu- 
factured by LEGO®, see Fig 4. Its 5 support modes are shown along with the 
support mode index used. After training using an angular step size of Auj = 4°, 
we placed the rabbit with one particular support mode 100 random poses, i .e. 
values of (x, y, co) in the field of view. The support modes detected by the vision 
system were recorded. This test was repeated for the remaining support modes. 
The results are shown in the Table 2. The two confused support modes were 
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Fig. 3. Setup for training and test. The calibration template is used for calibrating the 
camera relative to a global coordinate system. 



Table 1. Properties and parameters of the objects and the test setup. 



Angle a 


25° 


Height h 


800 mm 


Field of view 


400 mm x 300 mm 


Auj angular step size during training 


4°- 7.2° 


A.y =translational step size during training 


50 mm 


Camera resolution (pixels) 


768 x 576 


Number of support modes of objects 


3-5 


Typical linear object dimensions 


25-40 mm 


Typical linear dimensions of object images 


40-55 pixels 


Silhouette descriptors 


8 lowest order centr. moments 
+ width & height of silhouette 


Number of data base records per object 


900-1500 for Auj = 7.2° 


Training time per support mode 


5 min. 


Recognition time (after 2D segmentation) 


5-10 ms 



'support by four paws' and 'support by one ear and fore paws'. It can be under- 
stood from Fig. 4, that these two support modes are most likely to be mixed up 
in a pose estimation. We repeated the experiment with Auo = 7.2°. In this case 
no errors were measured in the 500 tests. 
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Fig. 4. The toy rabbit shown in its five support modes. The support mode indices used 
in Table 2 are written in the upper left corner. 

Table 2. Statistics of the support mode detection when the toy rabit was placed at 
random positions (x, y, u>) in the field of view. Each support mode was tested 100 times 
and the experiments involved two different angular step sizes in the training. 



Angular step size 


7.2° 


4° 


True 4-? Detected — > 


0 


1 


2 


3 


4 


0 


1 


2 


3 


4 


0 standing 


100 










100 










1 lying on left side 




100 










100 








2 lying on right side 






100 










100 






3 on fore &; hind paws 








98 


2 








100 




4 on ear & fore paws 








1 


99 










100 



5 Discussion 

A complete vision system for flexible grasping consists of two processes, one per- 
forming the 2D segmentation, and one giving the 3D interpretation. We have 
developed a method to be used in the second component only, since we used a 
illumination and object configuration giving very simple and robust segmenta- 
tion. 

The method developed is attractive with respect to the following aspects: 

• High speed of the 3D interpretation. 

• Generality concerning the object shape. 

• Flexibility of camera position and object shapes, since tall objects, closely 
positioned cameras, and oblique viewing directions are allowed. In case of 
ambiguity in the pose estimation when viewed by a single camera, it is easy 
to use 2 or more cameras with independent 3D interpretation. 

• Simple and fast training without assistance from vision experts. 

The robustness and total recognition speed depends critically on the 2D 
segmentation, and so we can not conclude on these two quality parameters. The 
method in its present form is not suitable for occluded objects. 

One remaining property to be discussed is the accuracy of the pose esti- 
mation. In our test the grasping uncertainty was about +/-2 mm and +/- 3°. 
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However, the origin of these uncertainties were not traced, so they may be re- 
duced significantly by careful camera-robot co-calibration. 

In our experiments we used a rather coarse descretization of y and gj, and 
only one object at a time was recognized. The recognition time in the experiment 
was typically 20 ms per object (plus segmentation time). This short processing 
times gives plenty of room for more demanding tasks involving more objects, 
more support modes, and higher accuracy through a finer discretizaion of y and 
00. 

6 Conclusion 

We have developed and tested a computer vision concept appropriate in a brute 
force method based on data bases of image descriptors. We have shown that a 
significant reduction of the continuous degrees of freedom necessary in the train- 
ing can be achieved by applying a suitable 2D transformation during recognition 
prior to the match search. The advantages are the reductions of the time and 
the storage used in the training process. 

The prototype developed will be used for studying a number of properties 
and possible improvements. First, various types of descriptors and classification 
strategies will be tested. Here, color and gray tone information should be in- 
cluded. Second, the over-all performance with different 2D segmentation strate- 
gies will be studied, particularly those allowing occluded objects. Finally, the 
concept of training space truncation should be extended to systems recognizing 
objects of arbitrary pose. 
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